I am feeling curious, excited and a bit nervous as well. I expect to learn about programming and data analysis in R langauage, so far I have been using spss. I want to also learn how to analyse big data and find patterns in data that could lead to new findings. I found the course through UEF WebOodi.
different formats
image
Here is the link to my GitHUB repository https://github.com/saranglq/IODS-project
Here is the link to my diary page https://saranglq.github.io/IODS-project/
Another link to my diary page
Describe the work you have done this week and summarize your learning.
I started this week with learning R through the Data Camp platform. I learned how to handle (wrangle) data in R. I learned how to modify existing datasets, merge different datasets, and carry out simple and multiple linear regressin analyses. I also learned how to make graphical representations of these analyses.
The dataset given to us was from the “International survey of Approaches to Learning” funded by the Teacher’s Academy funding for Kimma Vehkalahti. Data was collected during 2013-2015
Oridinal dataset had 183 observations for 60 variables. Variables with prefix A adn C were based on componenets of A and C parts of ASSIST (Approaches and Study Skills Inventory for Students). Prefix D variables were based on SATS (Survey of Attitudes Toward Statistics). Items from the ASSIST B part were named so that their connection to the relevant dimension (Deep/Surface/Strategic) was maintained
The dataset had multiple variables and I was supposed to merge assist B components into three major variables (Deep, Surface, Strategy). I also renamed three other variables (Age Attitude, Points -> age, attitude, points). Then I had to exclude other variables after selecting only 7 variables. The resulting dataset I saved as lrn2014.csv which has 7 variables and 166 observations for each variable.
Gender distribution was uneven, there were 110 female and 56 male subjects.
Distributions of each variable (except gender) are given below:
Following is a summary table for all variables:
| Variable | Mean | Median | Min | Max |
|---|---|---|---|---|
| Age | 25.51 | 2.83 | 17 | 55 |
| Attitude | 31.43 | 32 | 14 | 50 |
| Deep | 3.68 | 3.18 | 1.58 | 4.91 |
| Strategy | 3.12 | 3.18 | 1.25 | 5.00 |
| Surface | 2.78 | 2.83 | 1.58 | 4.33 |
| Points | 22.72 | 23.00 | 7.00 | 33.00 |
I first carried out an exploratory analysis to find out which variables correlate best with points:
Attitude, strategy and surface had the strongest corellation and therefore were chosen as the predictor variables
I ran a multiple linear regression model and got following results
According to the results, attitude is the only variable that has significant linear relationship with points (p < 0.001). Each 1 point increase in attitude results in an increase of 0.34 in points.
Non-significant variables were removed and model was fitted again.
According to the final model, attitude has still a significant linear relationship with points, and each 1 point increase in attitude results in an increase of 0.35 in points. Multiple R-squared value is 0.19 which means that attitude explains 19 percent of variance in points.
Assumptions of the multiple linear model are that the
Errors are normally distributed Errors are not correlated Errors have constant variance The size of a given error does ot depend oin the explanatory variables
Diagnostic plots were created to check these assumptions
QQ-plot demostrates that the errors of the model are normally distributed, there are howeevr values at extremities that deviate from the normal line.
The scatterplot of residuals vs the fitted values shows no pattern, i.e. the spread remains largely similar with the increase in fitted value. Howeverthere are some residuals that are far form the fitted line. These could potentially be outliers
The residuals vs leverage plot shows that none of the eobservations have an unusually high impact on the results.
The current dataset is called “Student Performance Data Set” and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Student+Performance. Data compares student performance in two subjects: Portuguese and Mathematics.G1 referes to grades received during the first period and similary G2 and G3 correspond to grades received in second and third periods.
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)
After data wrangling the dataset was saved as alc.csv which contains 35 variable and 182 observations
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
## [1] 382 35
I would be interested in examining the relationship between alcohol use and:
Parental cohabitation: I think that tense family relationships can lead to a higher stress in teenagers and therfore a tendency towards more consumption of alcohol.
Mother’s education: Research shows that higher education for women results in multipe positive outcomes for family and society. This could also reflect in the attitudes of children.
Romantic relationships: Failure to bond during teenage years and peer pressure to be in a relationship can result in higher tendency to drink.
Current health status: It might be a causal factor for drinking but also vice versa i.e. excessive drinking could be related to bad health status.
Parental cohabitation and romantic relationship are binomial variables. Mother’s eductation and health are categporical, ordinal variables. The bar charts showing their distributions are presented below.
The charts show the distribution of subects based on their relationship status, parental cohabitation and usage of alcohol.
## # A tibble: 4 x 3
## # Groups: Pstatus [2]
## Pstatus high_use count
## <fct> <lgl> <int>
## 1 A FALSE 26
## 2 A TRUE 12
## 3 T FALSE 242
## 4 T TRUE 102
## # A tibble: 4 x 3
## # Groups: romantic [2]
## romantic high_use count
## <fct> <lgl> <int>
## 1 no FALSE 180
## 2 no TRUE 81
## 3 yes FALSE 88
## 4 yes TRUE 33
None of the chosen variables are significantly associated with high alchohol use. The variables and their resulting odds ratios for high alcohol use are presented along with the confidence intervals. Odds ratios along with confidence intervals for association of each variable with high alcohol use are also presented,
##
## Call:
## glm(formula = high_use ~ Pstatus + Medu + romantic + health,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4627 -0.8821 -0.7091 1.2911 2.0228
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.10674 1.32994 0.080 0.936
## PstatusT -0.08875 0.38587 -0.230 0.818
## Medu1 -0.91856 1.27172 -0.722 0.470
## Medu2 -1.90163 1.26062 -1.508 0.131
## Medu3 -0.91270 1.25197 -0.729 0.466
## Medu4 -1.27811 1.24739 -1.025 0.306
## romanticyes -0.15946 0.25057 -0.636 0.525
## health2 0.76463 0.48160 1.588 0.112
## health3 0.13564 0.44268 0.306 0.759
## health4 0.32219 0.45049 0.715 0.474
## health5 0.63142 0.39548 1.597 0.110
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 448.12 on 371 degrees of freedom
## AIC: 470.12
##
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.1126471 0.086924876 27.053858
## PstatusT 0.9150754 0.437244590 2.007650
## Medu1 0.3990923 0.017558764 4.548870
## Medu2 0.1493248 0.006654147 1.665484
## Medu3 0.4014386 0.018078594 4.410220
## Medu4 0.2785625 0.012611116 3.033450
## romanticyes 0.8526065 0.517603874 1.385456
## health2 2.1482059 0.843454699 5.637405
## health3 1.1452705 0.486362057 2.792529
## health4 1.3801401 0.576722981 3.413775
## health5 1.8802695 0.888022503 4.233663
I chose study time, activities and going out with friends. Distributions of these variables are presented through bar charts.
Based on the distributions I chose to explore gender, study time, activities and going out. Gender and going out were significantly associated with high alcohol use. Study time also had a significant but weak association.
Males were more likely to drink heavily (OR 2.24 CI95% 1.32 - 3.84). Compared to the reference group, students who went out most frequently were more likely to consume alcohol heavily (OR 10.73 CI 95% 3.036 - 51.42)
##
## Call:
## glm(formula = high_use ~ sex + studytime + activities + goout,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7617 -0.7725 -0.5228 0.8228 2.5617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8672 0.6722 -2.778 0.005474 **
## sexM 0.8073 0.2721 2.966 0.003012 **
## studytime2 -0.3456 0.2970 -1.164 0.244524
## studytime3 -0.9712 0.4809 -2.019 0.043455 *
## studytime4 -1.1241 0.6298 -1.785 0.074310 .
## activitiesyes -0.4044 0.2590 -1.561 0.118454
## goout2 0.3503 0.6956 0.504 0.614510
## goout3 0.5243 0.6813 0.770 0.441507
## goout4 2.0665 0.6816 3.032 0.002430 **
## goout5 2.3735 0.7020 3.381 0.000722 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 384.35 on 372 degrees of freedom
## AIC: 404.35
##
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.1545556 0.03368531 0.5095379
## sexM 2.2418487 1.32119716 3.8496160
## studytime2 0.7077842 0.39492190 1.2688956
## studytime3 0.3786406 0.14036906 0.9407166
## studytime4 0.3249548 0.08308952 1.0313723
## activitiesyes 0.6673741 0.39987596 1.1063789
## goout2 1.4195109 0.40277427 6.7023293
## goout3 1.6893594 0.49743864 7.8225116
## goout4 7.8972057 2.33914417 36.6844868
## goout5 10.7354006 3.03673066 51.4240176
Previously significant variables (sex, studying time, going out) were chosen for this model. The chart represents the high/non “high users” that were users that were classified correctly by the model as high users that were classified correctly or incorrectly. The tabulated chart shows the proportion of subjects classified correctly and incorrectly by the model.
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65183246 0.04973822 0.70157068
## TRUE 0.15445026 0.14397906 0.29842932
## Sum 0.80628272 0.19371728 1.00000000
Loss function of the training model was:
## [1] 0.2041885
If a simple guessing strategy would be employed, there would be a 50 percent change of getting a correct answer since the outcome is a binomial outcome. The model however is more efffective as it gives a correct result 80 percent of the time, as demonstrated through the training set.
## [1] 0.2303665
The 10 fold cross validation test resulted in a 0.22 error for my model, which is less than 0.26 for the model introduced in DataCamp.
Let us fit all the variables into a logistic regression model.
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
## [36] "probability" "prediction"
##
## Call:
## glm(formula = high_use ~ school + sex + studytime + goout + Pstatus +
## Medu + Fedu + Mjob + Fjob + reason + nursery + internet +
## guardian + traveltime + studytime + failures + schoolsup +
## famsup + paid + activities + higher + romantic + famrel +
## freetime + goout + health + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9372 -0.6024 -0.2891 0.3696 2.9525
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.00788 998.11272 -0.016 0.987204
## schoolMS -0.31881 0.56801 -0.561 0.574610
## sexM 1.14045 0.36391 3.134 0.001725 **
## studytime2 -0.10816 0.39602 -0.273 0.784764
## studytime3 -0.58458 0.62948 -0.929 0.353053
## studytime4 -0.51937 0.81835 -0.635 0.525648
## goout2 0.35067 0.92424 0.379 0.704381
## goout3 0.39244 0.89417 0.439 0.660744
## goout4 2.48140 0.90743 2.735 0.006247 **
## goout5 2.57466 0.94753 2.717 0.006583 **
## PstatusT -0.64279 0.57431 -1.119 0.263036
## Medu1 -1.46770 1.63389 -0.898 0.369033
## Medu2 -2.29763 1.63061 -1.409 0.158818
## Medu3 -0.97329 1.65010 -0.590 0.555302
## Medu4 -1.11258 1.71722 -0.648 0.517053
## Fedu1 13.12642 998.10935 0.013 0.989507
## Fedu2 13.35623 998.10933 0.013 0.989323
## Fedu3 13.14915 998.10933 0.013 0.989489
## Fedu4 13.46401 998.10938 0.013 0.989237
## Mjobhealth -1.49344 0.86167 -1.733 0.083061 .
## Mjobother -0.81758 0.57177 -1.430 0.152744
## Mjobservices -1.17488 0.64412 -1.824 0.068150 .
## Mjobteacher -0.54941 0.80208 -0.685 0.493354
## Fjobhealth 0.51742 1.31546 0.393 0.694071
## Fjobother 1.19438 0.99984 1.195 0.232255
## Fjobservices 1.97809 1.02534 1.929 0.053705 .
## Fjobteacher -0.58137 1.19604 -0.486 0.626915
## reasonhome 0.58678 0.43925 1.336 0.181590
## reasonother 1.59275 0.60414 2.636 0.008379 **
## reasonreputation 0.12521 0.45818 0.273 0.784633
## nurseryyes -0.38504 0.41007 -0.939 0.347750
## internetyes -0.18085 0.47879 -0.378 0.705634
## guardianmother -0.65645 0.39486 -1.662 0.096413 .
## guardianother -0.70067 0.94726 -0.740 0.459495
## traveltime2 -0.30249 0.38375 -0.788 0.430558
## traveltime3 1.91997 0.80682 2.380 0.017328 *
## traveltime4 3.26053 1.12585 2.896 0.003779 **
## failures 0.17111 0.29453 0.581 0.561252
## schoolsupyes -0.35779 0.50971 -0.702 0.482709
## famsupyes -0.08541 0.35824 -0.238 0.811558
## paidyes 0.67650 0.35640 1.898 0.057673 .
## activitiesyes -0.35070 0.34488 -1.017 0.309210
## higheryes -0.11162 0.78940 -0.141 0.887556
## romanticyes -0.45218 0.36731 -1.231 0.218300
## famrel2 0.67295 1.40011 0.481 0.630773
## famrel3 0.63499 1.27185 0.499 0.617593
## famrel4 -0.54507 1.24670 -0.437 0.661956
## famrel5 -1.36245 1.26822 -1.074 0.282688
## freetime2 1.02423 1.12818 0.908 0.363952
## freetime3 1.57764 1.09818 1.437 0.150831
## freetime4 2.01108 1.12377 1.790 0.073521 .
## freetime5 1.50294 1.19980 1.253 0.210329
## health2 1.09810 0.72431 1.516 0.129504
## health3 0.50881 0.68053 0.748 0.454660
## health4 0.74525 0.68633 1.086 0.277547
## health5 0.97885 0.61138 1.601 0.109368
## absences 0.10369 0.03037 3.415 0.000639 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 292.37 on 325 degrees of freedom
## AIC: 406.37
##
## Number of Fisher Scoring iterations: 14
## [1] 0.1910995
## [1] 0.2565445
The training error of this model was 0.19 and the tesing error was 0.24
The significant variables were selected and another model was created.
##
## Call:
## glm(formula = high_use ~ sex + goout + reason + traveltime +
## absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7762 -0.7111 -0.4500 0.6284 2.5265
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.31613 0.76071 -4.359 1.31e-05 ***
## sexM 0.97063 0.27099 3.582 0.000341 ***
## goout2 0.49127 0.74900 0.656 0.511890
## goout3 0.54824 0.73191 0.749 0.453824
## goout4 2.23550 0.73810 3.029 0.002456 **
## goout5 2.35217 0.75767 3.104 0.001906 **
## reasonhome 0.36755 0.33529 1.096 0.272983
## reasonother 1.15017 0.46511 2.473 0.013403 *
## reasonreputation -0.16018 0.35498 -0.451 0.651819
## traveltime2 -0.11500 0.30300 -0.380 0.704275
## traveltime3 1.47622 0.57538 2.566 0.010298 *
## traveltime4 1.87986 0.92470 2.033 0.042057 *
## absences 0.08835 0.02377 3.717 0.000202 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 360.59 on 369 degrees of freedom
## AIC: 386.59
##
## Number of Fisher Scoring iterations: 4
## [1] 0.1884817
## [1] 0.2015707
This model had a better performance on the testing set, i.e. testing error was 0.21. Finally I decided to remove reason (to choose school) and travel time variables (because they were weak predictors and the category other for reason variable was hard to determine).
##
## Call:
## glm(formula = high_use ~ sex + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7601 -0.7278 -0.4717 0.7663 2.2870
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.89554 0.71197 -4.067 4.76e-05 ***
## sexM 1.02291 0.26103 3.919 8.90e-05 ***
## goout2 0.35631 0.73254 0.486 0.626685
## goout3 0.42736 0.71684 0.596 0.551053
## goout4 1.99158 0.71679 2.778 0.005461 **
## goout5 2.25925 0.73593 3.070 0.002141 **
## absences 0.08396 0.02278 3.685 0.000229 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 380.07 on 375 degrees of freedom
## AIC: 394.07
##
## Number of Fisher Scoring iterations: 4
## [1] 0.2068063
## [1] 0.2068063
Final model had 3 predictors and a testing set loss function of 0.21.
## n_variables training_error
## 1 27 0.1910995
## 2 5 0.1884817
## 3 3 0.2068063
## n_variables testing_error
## 1 27 0.2565445
## 2 5 0.2015707
## 3 3 0.2068063
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)
Now we shall load the dataset Boston from MASS package and explore it.
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
dim(Boston)
## [1] 506 14
The data has 14 variables and 506 measurements. Each measurement corresponds to a suburb of Boston i.e. 506 suburbs in total for this dataset. The variables are described in detail at https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
Data contains 12 numeric variables and 2 integer variables.
The variables arevarlabels = c("per capita crime rate","proportion of residential land zoned","proportion of industrial zones per town","towns bordering charles river N/Y","NO concentration, PP10M","average number of rooms per dwelling","proportion of units built pre-1940", "weighted means of distance to 5 Boston employment centres","index of accessibility to radial highways","property tax rate per $10,000","pupil to teacher ratio", "Proortion of blacks by town","lower status of population","median value of homes in $1000s")
selcols <- colnames(Boston)
selvars <- dplyr::select(Boston, one_of(selcols))
a=0
for(var in selvars) {
if(is.integer(var)){
a = a +1
plot <-ggplot(Boston, aes(var, ..count..)) + geom_bar() + xlab(varlabels[a]) + ylab("Number of suburbs")
print(plot)
} else {
a = a +1
plot <- qplot(var,
geom="histogram",
binwidth = ((1/7.9) * sd(var)) * 3.49,
main = paste("Histogram for", varlabels[a]) ,
xlab = (varlabels[a]),
fill=I("blue"),
col=I("red"),
alpha=I(.2))
print(plot)
}
}
cor_matrix<-cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The correlation matrix shows that the strongest positive correlations are between 1. Accessibility to highways and property tax rate, 2. Industrialization and NO concentration and 3. Number of rooms per dwelling and median value of homes. The strongest negative corellations are between 1. Proportion of buildings built prior to 1940 and distances to employment centres, 2. NO concentration and proportion of buildings built prior to 1940, 3. Proportion of industrial zones and distances to employment centres and 4 Percent lower status of population and median value of owner occupied homes.
Already many patterns can be drawn from this data. Accessibility to highways means higher development and could mean higher property tax rates. More industrial zones can explain higher NO concentrations. Number of rooms per dwelling could explain the median value of homes as larger homes are more expensive. Negative relationship between percent of lower status population and median value of homes shows that richer 0suburbs have wealthier people.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
label <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = label)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
After standardizing the data, the mean of all variables has been centered towards the zero and the range has decreased, but still maintaining similar proportions between them.
Now we shall split the data into train and test set.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
dim(train)
## [1] 404 14
dim(test)
## [1] 102 14
Drawing the LDA bi-plot. Categorical crime rate variable was used as the target and all other variables were used as predictors.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
## predicted
## correct low med_low med_high high Sum
## low 17 6 1 0 24
## med_low 4 14 4 0 22
## med_high 0 9 20 0 29
## high 0 0 0 27 27
## Sum 21 29 25 27 102
When testing the trained model on test dataset, the model predicted correctly 9 out of 23 low crime, 15 out of 26 med_low crime, 15 out of 23 med_high crime and 30 out of 30 high crime suburbs.
data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
dist_eu <- dist(Boston, method = "euclidean")
dist_man <- dist(Boston, method = 'manhattan')
km <-kmeans(boston_scaled, centers = 3)
pairs(boston_scaled[6:10], col = km$cluster)
Checking the optimum number of clusters and re-analyzing
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <-kmeans(boston_scaled, centers = 4)
pairs(boston_scaled[6:10], col = km$cluster)
I could not find a clear drop in WCSS after two so I set the number of clusters at 3.
If we draw a pairs plot, we can see that we cannot separate the neighborhoods based on two variables. Some variables are good at separating 2 clusters, but almost no combination of two variables can separate three clusters in a good way.
So why don’t we try separating the clusters based on LDA instead of pairs? I chose 4 clusters for this purpose because it gives a good speeration of data.
data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
km <- kmeans(boston_scaled, centers = 4)
lda.fit <- lda(km$cluster~ ., data = boston_scaled)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(km$cluster)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 7)
According to the results from the LDA plot, proportion of black people in town, nitric oxide concentrations, industrialization, tax rate and crime rate were the biggest factors separating suburbs into four categories.
Higher crime rate clustered the suburbs in cluster one. Lesser (or Higher? unclear from the variable description and formula) proportion of black people clustered the towns in the opposite direction. NO concentration, industrialization and tax rate clustered suburbs towards left, i.e. clusters 1 and 2.It can also be seen that these zones were more accessible to radial highways.
We predicted crime rate category using other variables and plotted it in 3D!
The individual points were colored in the first graph according to the crime rate category and in the second graph using k means clustering (data was clustered into 4 groups in order to match with four categories of crime rate).
Even though in the second 3D graph the data was clustered using all available data and not specifically to predict crime, note that it still does a pretty good job in separating suburbs into four groups according to crime rate.
#PLOTTING 3D GRAPH FOR TRAIN SET WITH CRIME
data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
km <- kmeans(train, centers = 4)
bins <- quantile(train$crim)
label <- c("low", "med_low", "med_high", "high")
crime <- cut(train$crim, breaks = bins, include.lowest = TRUE, label = label)
train <- data.frame(train, crime)
lda.fit <- lda(crime ~ ., data = train)
classes <- as.numeric(train$crime)
model_predictors <- dplyr::select(train, -crime)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)
Loading the required libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)
library(stringr)
library(FactoMineR)
Loading human.csv data and carrying out the required data wrangling.
## 'data.frame': 195 obs. of 19 variables:
## $ hdirank : int 1 2 3 4 5 6 6 8 9 9 ...
## $ country : Factor w/ 195 levels "Afghanistan",..: 129 10 169 48 124 67 84 186 34 125 ...
## $ hdi : num 0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
## $ lifeexp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ eduexp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ meanedu : num 12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
## $ gni : Factor w/ 194 levels "1,096","1,123",..: 166 135 156 139 140 137 127 154 134 117 ...
## $ gniminusrank: int 5 17 6 11 9 11 16 3 11 23 ...
## $ giirank : int 1 2 3 4 5 6 6 8 9 9 ...
## $ gii : num 0.067 0.11 0.028 0.048 0.062 0.041 0.113 0.28 0.129 0.157 ...
## $ matmor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adobir : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ fparli : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## $ secedf : num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ secedm : num 96.7 94.6 96.6 96.6 90.5 97 78.6 94.8 100 95.3 ...
## $ labf : num 61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
## $ labm : num 68.7 71.8 74.9 66.4 70.6 66.4 68.1 68.9 71 73.8 ...
## $ secedfm : num 1.007 0.997 0.983 0.989 0.969 ...
## $ labfm : num 0.891 0.819 0.825 0.884 0.829 ...
## [1] 195 19
The variables have been derived from two datasets. First is the data used to create Human Devlopment Index (HDI). HDI was developed by the UNDP. The Gender Inequality Index is also devloped by the UNDP. According to the UNDP website, It measures gender inequalities in three important aspects of human development—reproductive health, measured by maternal mortality ratio and adolescent birth rates; empowerment, measured by proportion of parliamentary seats occupied by females; and economic status, expressed as labour market participation. The variables we will be using from these dataset are:
country: The country being described hdirank: HDI rank, which tells what is the rank of the country in terms of Human development index hdi: The Human Devleopment Index score lifeexp: Life expectancy at birth eduexp: Expected years of schooling meanedu: Mean years of schooling gni: gross national income per capita gniminusrank: Gross national income per capita rank minus HDI rank giirank: Gender Inequality Index rank gii: The Gender Inequality Index score matmor: Maternal mortality ratio adobir: Adolescent birth rate fparli: Percent representation of women in parliament secedf: Percent population (women) with secondary education secedm: Percent population (men) with secondary education labf: Percent participation (of women) in labor force labm: Percent participation (of men) in labor force
Using the above variables, following additional variables were created
secedfm: Ratio between % women and % men with secondary education labfm: Ratio between % women and % men in labor force
Now let’s make the changes as instructed in the exercise 5, data wrangling part
## secedfm labfm eduexp lifeexp gni matmor adobir fparli comp
## 1 1.0072389 0.8908297 17.5 81.6 64992 4 7.8 39.6 TRUE
## 2 0.9968288 0.8189415 20.2 82.4 42261 6 12.1 30.5 TRUE
## 3 0.9834369 0.8251001 15.8 83.0 56431 6 1.9 28.5 TRUE
## 4 0.9886128 0.8840361 18.7 80.2 44025 5 5.1 38.0 TRUE
## 5 0.9690608 0.8286119 17.9 81.6 45435 6 6.2 36.9 TRUE
## 6 0.9927835 0.8072289 16.5 80.9 43919 7 3.8 36.9 TRUE
## 7 1.0241730 0.7797357 18.6 80.9 39568 9 8.2 19.9 TRUE
## 8 1.0031646 0.8171263 16.5 79.1 52947 28 31.0 19.4 TRUE
## 9 1.0000000 0.8676056 15.9 82.0 42155 11 14.5 28.2 TRUE
## 10 0.9968520 0.8401084 19.2 81.8 32689 8 25.3 31.4 TRUE
## 11 0.9148148 0.7616580 15.4 83.0 76628 6 6.0 25.3 TRUE
## 12 0.9116162 0.7566372 15.6 84.0 53959 NA 3.3 NA FALSE
## 13 NA NA 15.0 80.0 79851 NA NA 20.0 FALSE
## 14 0.9908362 0.8880707 15.8 82.2 45636 4 6.5 43.6 TRUE
## 15 0.9989990 0.8107715 16.2 80.7 39267 8 25.8 23.5 TRUE
## 16 0.9934498 0.9108527 19.0 82.6 35182 4 11.5 41.3 TRUE
## 17 0.8641975 0.6948682 16.9 81.9 33890 27 2.2 16.3 TRUE
## 18 0.9667812 0.8379161 16.0 82.4 30676 2 7.8 22.5 TRUE
## 19 1.0000000 0.7848297 13.9 81.7 58711 11 8.3 28.3 TRUE
## 20 1.0139860 0.6931818 15.3 83.5 36927 6 5.4 11.6 TRUE
## 21 0.9348613 0.8010118 16.3 80.8 41187 6 6.7 42.4 TRUE
## 22 0.9375000 0.8230519 16.0 82.2 38056 12 5.7 25.7 TRUE
## 23 1.0000000 0.8064993 15.7 81.4 43869 4 4.1 30.3 TRUE
## 24 1.0000000 0.8703125 17.1 80.8 38695 4 9.2 42.5 TRUE
## 25 0.9775510 0.8275316 16.8 80.4 27852 7 0.6 27.7 TRUE
## 26 0.9138167 0.7978723 17.3 82.6 32045 4 10.6 38.0 TRUE
## 27 0.8844720 0.6655462 16.0 83.1 33030 4 4.0 30.1 TRUE
## 28 1.0020060 0.7481698 16.4 78.6 26660 5 4.9 18.9 TRUE
## 29 0.8880597 0.7072000 17.6 80.9 24524 5 11.9 21.0 TRUE
## 30 1.0000000 0.8156749 16.5 76.8 25214 11 16.8 19.8 TRUE
## 31 0.9424779 0.6985392 14.5 78.8 72570 27 23.0 NA FALSE
## 32 0.9302326 0.7876231 14.0 80.2 28633 10 5.5 12.5 TRUE
## 33 1.1305085 0.5319372 13.8 78.2 123124 6 9.5 0.0 TRUE
## 34 1.0040568 NA 13.5 81.3 43978 NA NA 50.0 FALSE
## 35 0.9959799 0.7448980 15.1 76.3 25845 7 15.9 18.7 TRUE
## 36 0.9286550 0.7534669 15.5 77.4 23177 3 12.2 22.1 TRUE
## 37 0.9448568 0.8291233 16.4 73.3 24500 11 10.6 23.4 TRUE
## 38 0.8772379 0.5716440 14.4 80.6 27930 9 18.2 13.0 TRUE
## 39 0.8605974 0.2579821 16.3 74.3 52821 16 10.2 19.9 TRUE
## 40 0.9774306 0.6333333 17.9 76.3 22050 69 54.4 36.8 TRUE
## 41 1.1944444 0.5054348 13.3 77.0 60868 8 27.6 17.5 TRUE
## 42 0.9594241 0.6577540 15.2 81.7 21290 22 55.3 15.8 TRUE
## 43 0.9896266 0.8293051 16.3 80.9 25757 8 12.6 31.3 TRUE
## 44 0.9918946 0.7466667 15.4 75.2 22916 14 12.1 10.1 TRUE
## 45 1.1031128 0.4510932 14.4 76.6 38599 22 13.8 15.0 TRUE
## 46 0.9989899 0.8121302 15.2 74.2 22281 13 13.5 18.0 TRUE
## 47 0.9081197 0.7654110 14.8 77.3 19409 13 12.7 25.8 TRUE
## 48 0.9875666 0.5246691 14.7 74.4 83961 14 14.5 1.5 TRUE
## 49 0.8891235 0.7504363 15.2 76.2 14558 7 15.2 17.3 TRUE
## 50 0.9436009 0.7939778 15.7 71.3 16676 1 20.6 30.1 TRUE
## 51 0.9686486 0.7963738 14.7 70.1 22352 24 25.7 14.5 TRUE
## 52 0.8266200 0.3510896 13.6 76.8 34858 11 10.6 9.6 TRUE
## 53 0.9358696 0.7503852 14.2 74.7 18108 33 31.0 12.0 TRUE
## 54 1.0815109 0.7239583 15.5 77.2 19283 14 58.3 11.5 TRUE
## 55 1.0410959 0.8738966 12.6 75.4 21336 37 28.5 16.7 TRUE
## 56 0.9645749 0.8690629 15.0 69.4 20867 26 29.9 20.1 TRUE
## 57 1.0205245 0.8603133 15.4 75.6 12488 52 48.4 19.6 TRUE
## 58 NA NA 14.0 76.1 20070 NA 49.3 25.7 FALSE
## 59 0.9717868 0.8118644 14.4 74.2 15596 5 35.9 20.4 TRUE
## 60 NA NA 13.7 72.7 13496 NA NA 10.3 FALSE
## 61 1.0821643 0.5990220 13.3 77.6 18192 85 78.5 19.3 TRUE
## 62 0.9130435 0.5880795 12.7 74.7 22762 29 5.7 14.2 TRUE
## 63 0.8517241 0.5876011 15.6 74.4 17470 73 30.9 11.6 TRUE
## 64 1.0045045 NA 13.4 73.1 23300 NA 56.3 43.8 FALSE
## 65 0.9802956 0.7019868 12.3 70.4 26090 84 34.8 24.7 TRUE
## 66 0.7934783 0.7307061 14.4 74.9 12190 16 16.9 34.0 TRUE
## 67 0.9428934 0.6200000 13.8 79.4 7301 80 43.1 48.9 TRUE
## 68 0.9566787 0.3286319 13.8 79.3 16509 16 12.0 3.1 TRUE
## 69 1.0039604 0.5898734 13.9 79.4 13413 38 60.8 33.3 TRUE
## 70 0.9201183 0.2255435 15.1 75.4 15440 23 31.6 3.1 TRUE
## 71 1.1141732 0.6452020 14.2 74.2 16159 110 83.2 17.0 TRUE
## 72 0.6500000 0.4152542 14.5 75.3 18677 20 30.9 14.4 TRUE
## 73 0.9515707 0.4600262 13.7 74.9 9779 29 16.9 5.8 TRUE
## 74 0.9191419 0.5644556 13.1 76.8 16056 49 63.4 37.1 TRUE
## 75 1.0419847 0.7351485 15.2 74.5 15175 69 70.8 9.6 TRUE
## 76 0.9676375 0.7523302 13.8 74.9 7164 41 46.8 11.3 TRUE
## 77 NA NA 12.9 73.8 20805 NA NA 6.7 FALSE
## 78 0.9620123 0.9037356 11.9 70.8 16428 26 40.0 15.6 TRUE
## 79 NA NA 15.8 73.4 10939 23 35.4 25.0 FALSE
## 80 0.8853503 0.2342342 13.5 74.0 11365 50 26.5 11.6 TRUE
## 81 0.7230216 0.6385185 13.4 75.4 11780 7 18.3 33.3 TRUE
## 82 0.9562044 0.7952167 15.1 71.0 8178 23 25.7 11.8 TRUE
## 83 0.8612903 0.2105263 14.0 74.8 13054 89 10.0 25.7 TRUE
## 84 0.8517398 0.8080569 13.1 74.6 11015 89 50.7 22.3 TRUE
## 85 0.9306030 0.6854962 11.8 77.8 9943 21 15.3 20.7 TRUE
## 86 0.9894737 0.7465565 12.3 74.7 8124 29 27.1 10.7 TRUE
## 87 0.6432665 0.5951134 13.6 76.5 9638 8 15.1 19.3 TRUE
## 88 1.0177665 0.6614268 14.2 75.9 10605 87 77.0 41.6 TRUE
## 89 NA 0.8228346 12.6 75.1 9765 34 56.3 20.7 FALSE
## 90 0.8164117 0.8160920 13.1 75.8 12547 32 8.6 23.6 TRUE
## 91 0.9953488 0.5208333 15.7 70.0 7493 59 42.8 14.0 TRUE
## 92 1.0142687 0.8167388 14.6 69.4 10729 68 18.7 14.9 TRUE
## 93 0.8750000 0.7967782 13.5 74.4 13323 26 41.0 6.1 TRUE
## 94 1.2801724 NA 12.7 77.8 9994 NA NA 21.9 FALSE
## 95 1.3245823 0.3926702 14.0 71.6 14911 15 2.5 16.0 TRUE
## 96 0.7114967 0.3540197 14.6 74.8 10404 46 4.6 31.3 TRUE
## 97 1.0233813 0.7001255 13.5 74.0 12040 83 68.5 20.9 TRUE
## 98 NA 0.7141026 13.4 72.9 9937 45 54.5 13.0 FALSE
## 99 1.0541311 0.7912553 12.4 75.7 7415 80 70.1 16.7 TRUE
## 100 0.9909400 0.7171582 14.7 72.8 5069 120 18.1 0.0 TRUE
## 101 1.0079156 0.5978129 13.6 70.0 7614 45 71.4 13.3 TRUE
## 102 1.0470810 0.6526718 13.1 73.5 11883 100 99.6 19.1 TRUE
## 103 0.9469214 0.5886628 12.7 71.1 15617 130 35.2 11.8 TRUE
## 104 0.8348624 0.7251613 13.0 76.8 12328 31 4.2 5.9 TRUE
## 105 1.0716667 0.4023973 12.9 73.4 5327 58 28.3 6.1 TRUE
## 106 0.9448010 0.8811275 12.5 64.5 16646 170 44.2 9.5 TRUE
## 107 0.9689441 0.8506787 11.9 71.6 5223 21 29.3 20.8 TRUE
## 108 0.7244224 0.3168449 13.5 71.1 10512 45 43.0 2.2 TRUE
## 109 NA 0.6098830 10.8 65.6 13066 61 18.0 25.8 FALSE
## 110 1.4930748 0.8593272 12.5 64.4 16367 240 103.0 16.2 TRUE
## 111 0.8109756 0.6104513 13.0 68.9 9788 190 48.3 17.1 TRUE
## 112 0.8558140 0.6568396 11.9 72.9 7643 110 67.0 16.8 TRUE
## 113 0.9074074 0.2319277 13.0 72.9 4699 NA 45.8 NA FALSE
## 114 NA 0.6362434 11.5 68.4 5567 36 38.8 16.4 FALSE
## 115 1.0345369 0.6411543 11.3 68.2 7915 120 46.8 27.1 TRUE
## 116 0.8440367 0.6050633 12.3 73.0 7349 69 76.0 27.4 TRUE
## 117 0.9578393 0.7355372 13.6 57.4 12122 140 50.9 40.7 TRUE
## 118 0.8342697 0.8880779 11.9 75.8 5092 49 29.0 24.3 TRUE
## 119 0.8054146 0.7935723 13.2 68.3 5760 200 71.9 51.8 TRUE
## 120 0.9762397 0.7044025 12.5 70.6 3044 75 29.3 23.3 TRUE
## 121 0.5537849 0.2134670 10.1 69.4 14003 67 68.7 26.5 TRUE
## 122 NA 0.6152927 13.5 73.3 6094 53 70.6 20.8 FALSE
## 123 NA NA 11.7 69.1 3432 96 18.6 0.0 FALSE
## 124 1.2615063 0.5291925 10.3 66.4 6522 250 88.5 31.3 TRUE
## 125 1.0287206 0.5902864 11.5 74.9 4457 100 100.8 39.1 TRUE
## 126 0.6854305 0.3496042 11.6 74.0 6850 120 35.8 11.0 TRUE
## 127 0.9680233 0.8587127 11.3 64.8 9418 130 54.9 37.7 TRUE
## 128 0.9439655 0.5589569 10.7 71.8 6929 140 97.2 13.3 TRUE
## 129 1.0427632 0.7639429 11.2 69.4 2517 44 42.8 15.2 TRUE
## 130 0.4770318 0.3379224 11.7 68.0 5497 190 32.8 12.2 TRUE
## 131 1.0852713 0.5162847 11.1 73.1 3938 120 84.0 25.8 TRUE
## 132 0.9855072 0.8639896 12.6 69.5 7176 120 40.9 8.3 TRUE
## 133 NA 0.4842520 11.7 68.2 5363 270 52.2 38.5 FALSE
## 134 0.7283951 0.1856946 12.3 69.6 2728 49 41.6 12.4 TRUE
## 135 NA 0.7687500 10.6 71.9 2803 86 44.8 0.0 FALSE
## 136 0.8446809 0.9383562 11.1 62.3 6012 410 126.7 11.5 TRUE
## 137 NA NA 12.3 66.0 2434 130 16.6 8.7 FALSE
## 138 NA 0.8752711 9.0 57.6 21056 290 112.6 19.7 FALSE
## 139 0.5863636 0.8539720 13.5 60.1 3734 280 125.4 12.7 TRUE
## 140 0.6986090 0.9425770 11.5 61.4 3852 380 58.4 10.9 TRUE
## 141 0.6189189 0.9646018 10.6 66.2 4680 NA 65.0 25.0 FALSE
## 142 0.8256659 0.6825208 10.0 71.6 3191 170 80.6 20.0 TRUE
## 143 0.4323144 0.9109827 10.9 68.4 2949 170 44.3 19.0 TRUE
## 144 NA 0.5822622 11.3 66.5 2918 210 65.1 18.2 FALSE
## 145 0.8057325 0.8591160 11.0 61.6 2762 400 93.6 20.8 TRUE
## 146 0.4633508 0.9173364 12.4 69.6 2311 190 73.7 29.5 TRUE
## 147 0.4186551 0.2967431 7.8 66.2 4866 170 27.3 19.7 TRUE
## 148 1.4967320 0.9137303 8.6 65.9 4608 200 12.1 4.7 TRUE
## 149 NA 0.8231469 11.4 52.3 6822 460 170.2 36.8 FALSE
## 150 0.8423077 0.6131285 11.3 49.0 5542 310 72.0 14.7 TRUE
## 151 0.5894737 0.9767184 9.2 65.0 2411 410 122.7 36.0 TRUE
## 152 NA 0.7566719 9.0 52.8 5341 560 119.6 6.6 FALSE
## 153 0.6103152 0.8307292 10.4 55.5 2803 590 115.8 27.1 TRUE
## 154 NA 0.9569061 10.3 65.1 1328 440 122.8 20.5 FALSE
## 155 0.7854839 0.9275362 10.9 57.5 1615 470 60.3 35.1 TRUE
## 156 0.3971292 0.3628319 8.5 63.1 3560 320 73.3 22.2 TRUE
## 157 NA 0.6759494 9.2 67.9 1540 130 64.9 2.0 FALSE
## 158 0.5241379 0.9527027 9.9 62.6 2463 220 62.1 2.7 TRUE
## 159 NA 0.4394507 11.5 63.3 1456 350 51.1 3.0 FALSE
## 160 0.3220974 0.3518006 9.2 63.8 3519 270 47.0 0.7 TRUE
## 161 1.1526316 0.8027211 11.1 49.8 3306 490 89.4 26.8 TRUE
## 162 0.3995037 0.9913899 12.2 59.7 1228 450 91.5 17.6 TRUE
## 163 0.6363636 0.8577465 8.7 62.8 1669 380 42.0 3.5 TRUE
## 164 0.9090909 1.0128957 10.3 64.2 1458 320 33.6 57.5 TRUE
## 165 0.6835821 0.9570707 9.8 58.5 1613 360 126.6 35.0 TRUE
## 166 0.4185185 0.8633461 11.1 59.6 1767 340 90.2 8.4 TRUE
## 167 0.6648352 0.4118421 7.0 63.5 3809 360 84.0 23.8 TRUE
## 168 NA 0.5361891 6.4 62.0 3276 230 18.6 12.7 FALSE
## 169 NA NA 7.6 55.7 2332 730 75.3 24.3 FALSE
## 170 0.4675325 0.7500000 7.9 66.5 2188 320 94.4 42.7 TRUE
## 171 0.1979866 0.1987421 9.3 60.4 1885 400 86.8 27.6 TRUE
## 172 0.4651163 0.6437346 8.9 51.5 3171 720 130.3 9.2 TRUE
## 173 0.5138889 1.0380368 10.8 62.8 747 510 144.8 16.7 TRUE
## 174 0.4285714 0.8756999 8.5 64.1 1428 420 78.4 25.5 TRUE
## 175 0.5523810 0.8709288 8.8 60.2 1507 430 115.8 9.4 TRUE
## 176 0.3950617 0.9658470 9.8 58.7 680 730 135.3 8.2 TRUE
## 177 0.3918575 0.8981481 9.5 60.9 805 640 117.4 10.7 TRUE
## 178 NA 0.8687898 9.0 55.2 1362 560 99.3 13.7 FALSE
## 179 0.5099338 0.6240786 8.4 58.0 1583 550 175.6 9.5 TRUE
## 180 0.2258065 1.0326087 9.3 55.1 1123 480 137.8 39.6 TRUE
## 181 0.4608295 0.9521739 8.6 50.9 1780 1100 100.7 12.4 TRUE
## 182 NA 0.8378033 8.7 58.8 1096 650 131.0 21.9 FALSE
## 183 0.2812500 0.8566667 7.8 58.7 1591 400 115.4 13.3 TRUE
## 184 0.6385542 1.0158537 10.1 56.7 758 740 30.3 34.9 TRUE
## 185 0.1717172 0.8080808 7.4 51.6 2085 980 152.0 14.9 TRUE
## 186 NA 0.8908686 4.1 63.7 1130 380 65.3 22.0 FALSE
## 187 0.3782772 0.8531140 7.2 50.7 581 880 98.3 12.5 TRUE
## 188 0.3076923 0.4459309 5.4 61.4 908 630 204.8 13.3 TRUE
## 189 0.7289916 0.3081009 12.0 70.6 15722 155 45.4 14.0 TRUE
## 190 0.8250377 0.7884131 12.7 74.0 11449 72 21.2 18.7 TRUE
## 191 0.8784119 0.6514286 13.6 72.3 12791 28 30.8 19.0 TRUE
## 192 0.9836957 0.6729323 14.0 75.0 14242 85 68.3 27.0 TRUE
## 193 0.5329670 0.3711083 11.2 68.4 5605 183 38.7 17.5 TRUE
## 194 0.7015873 0.8537859 9.6 58.5 3363 506 109.7 22.5 TRUE
## 195 0.8333333 0.6558018 12.2 71.5 14301 210 47.4 21.8 TRUE
## [1] 155 8
Following is the summary of the current variables in the human data.
## secedfm labfm eduexp lifeexp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni matmor adobir fparli
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Now let’s visualize these variables.
varlabels = c("female to male secondary educ. ratio","female to male labor force ratio","expected years of schooling","life expectancy at birth","gross national income","maternal mortality","adolescent birth rate","percentage of women in parliament","","")
selcols <- colnames(human)
selvars <- dplyr::select(human, one_of(selcols))
a=0
typeof(human)
## [1] "list"
for(var in selvars) {
# if(is.numeric(var)){
# a = a +1
# plot <-ggplot(human, aes(var, ..count..)) + geom_bar() + xlab(varlabels[a]) + ylab("")
# print(plot)
#} else {
a = a +1
plot <- qplot(var,
geom="histogram",
binwidth = ((1/5.4) * sd(var)) * 3.49,
main = paste("Histogram for", varlabels[a]) ,
xlab = (varlabels[a]),
ylab = ("Number of countries"),
fill=I("blue"),
col=I("red"),
alpha=I(.2))
print(plot)
# }
}
We can see that in most cases, cpuntries had a lower proportion of wpomenn with secondary education when compared with men. However there were instances where higher proportion of women had secondary educaiton compared to men. Upon closer inspection, it was Gabon, Myanmar, Guyana. In most countries there were about 20 to 30 percent less women than men in the labor force. 13 to 15 years of schooling was common in most countries. Life expectancy at birth in most countries was about 75 years. GNI per capita was left skewed.Maternal mortality was also left skewed but very high in some countries, for example upto 1100 maternal deaths per 100,000 live births in Sierra Leone. Percentage of women variable was normally ditributed with a peak at 20 percent.
Let’s check the corellations between these variables.
cor_matrix<-cor(human)
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Strong negative associations were between 1. maternal mortality and life expectancy at birth 2. exposure to education and maternal mortality 3. Adolescent borth rate and life expectancy. Strong positive corellations were between 1. Adolescent borth rate and maternal mortality 2. Exposure to education and life expectancy.
pca_human <- prcomp(human)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
### Principal componenet analysis (after standardization)
human_std <- scale(human)
pca_human <- prcomp(human_std)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.5, 1), cex.axis = 0.7, col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA biplot ", cex.sub = 0.8, sub = "after standardizing the variables, GNI is no more the strongest contributing variable in the PCA anlysis. This is because the scaling makes it comparable in size to other variables.")
?biplot
## starting httpd help server ... done
The results are very different after standardization because standardization changes the scale of the data. Before, one variable (i.e. GNI) had a lot of variability and it was being given the highest weight because of that. PC1 was based entirely on variability within gni. After standardization, other variables could also influence the results.
According to my interpretation, PCA is the strongest principal component (i.e. catches the highest amount of variability, 53%). It is influenced by maternal mortality, adolescent birth rate in one direction and education exposure, life expectancy, secondary education ratio between men and women, and gni in the other direction.
The second pricipal component captures 16 percent variability in the data. It is influenced by number of women in the parliament and the participation of women in labor force.
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
The tea dataset is a part of factominer package and contains information about tea drinking preferences of 300 individuals.
Following is a distribution of participants according to their age and gender.
age <- tea$age
plot <- qplot(age,
geom="histogram",
binwidth = ((1/6.69) * sd(var)) * 3.49,
main = paste("Age and sex distribution of participants") ,
xlab = ("Age"),
fill=tea$sex,
col=I("red"),
alpha=I(.2)) + scale_fill_discrete(name = "Gender")
print(plot)
The following graphs describe the characteristics of the individuals in terms of key varaibale groups. First we explore the time preference for cosuming tea, i.e. lunch, evenign or dinner. IN the second group we explore the location where they consume tea usually and from where they purchase it. In the third group we explore how the tea is consumed.
selcols <- c("evening", "dinner", "lunch")
group1 <-select(tea, one_of(selcols))
gather(group1) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
selcols <- c("home", "pub", "work", "tearoom", "where")
group2 <-select(tea, one_of(selcols))
gather(group2) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
selcols <- c("How", "sugar", "how")
group2 <-select(tea, one_of(selcols))
gather(group2) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
I chose certain groups for the purpose of analyses.
#c("Tea" ,"evening", "dinner", "lunch", "sugar", "where", "home", "pub", "work", "tearoom" , "how","How", "sugar")
grp1 <- c("Tea" ,"evening", "dinner", "lunch")
grp2 <- c("Tea", "sugar", "how")
grp <- c()
a=0
for (num in 1:2) {
a=a+1
curgrp <- get(paste('grp',a, sep = ''))
curgrp
tea_mca <- select(tea, one_of(curgrp))
#gather(tea_mca) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
mca <- MCA(tea_mca, graph = FALSE)
p <- plot(mca, invisible=c("ind"), habillage = "quali")
print(p)
}
I first wanted to investigate is the time of consuming tea is related to the type of tea consumed. I found that green tea is usually consumed for dinner. Earl grey and black are usually not consumed for dinner. People were most likely to not consume tea with lunch.
In the second group of variables, I found that Earl Greay is usually consumed in tea bag for and with sugar. Green tea is bought both packaged and unpackaged and black tea is mostly bought unpackaged Sygar is susually not added to green or black tea.
Let’s load the required addons
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(lme4)
library(lmerTest)
The BPRS data contains data from 40 males, equally assigned to two treatment groups(20 each group). Brief psychiatric scale rating was measured at baseline i.e. before treatment and then every week subsequently, until week 8. BPRS is used to evaluate patients that are suspected of having schizophrenia.
The RATS data is from a nutrition study carried out on RATS. Each group was put on a different diet and their weight was followed for a period of 9 weeks. Measurements were taken every week, except week six when two measurements were taken.
The data in these datasets has been converted into long form. The script for creating long form data can be found in the data folder and the difference between both is explained in the script file.
Let’s read the datasets BPRS and RATS created using the instructions form the data wrangling exercise.
BPRS <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep=" ")
RATS <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep='\t')
RATSL <- read.table("data/RATSL.csv", sep=",", header=TRUE, check.names=FALSE, row.names = 1)
BPRSL <- read.table("data/BPRSL.csv", sep=",", header=TRUE, check.names=FALSE, row.names = 1)
glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Group <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, ...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks <fct> week0, week0, week0, week0, week0, week0, week0, wee...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
As the datasets were saved in csv format, after reading, the factor variables have turned into integers. Let’s convert them back to factor variables. This is important, so that the software knows that these are categorical and not a scalar variables.
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, ...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks <fct> week0, week0, week0, week0, week0, week0, week0, wee...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
Now we will be working with RATS dataset and as instructed, we will carry out the analyses in chapter 8 of Kimmo’s textbook, but using the RATS dataset instead of BPRS.
Let’s visualize the data.
ggplot(RATSL, aes(x = RATSL$Time, y = RATSL$Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
Here we can see can already see that within the three groups of rats, there is a difference in baseline weight. We can also see that there is an upward trend for weight in all groups.
What is apparent is that the rats with higher weight continued to have high weight throughout the study. This phenomenon is called tracking and we shall try to visualize it by “subtracting the relevant occasion mean from the original observation and then dividing by the corresponding visit standard deviation.” (source, Kimmo’s book)
# Standardizing weight
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdweight = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
# Plotting again with standardized weight
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
theme(legend.position = "none") +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = RATSL$stdweight)
The large number of individuals here leave a lot to the imagination though. It would be better is we could draw a more concise plot showing mean profiles for each group, and the variation between these measurements.
n <- RATSL$Time %>% unique() %>% length()
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = (sd(Weight) /
sqrt(n))) %>%
ungroup()
# Plotting the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.5)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
Another way to visualize data is through boxplots.
RATSL8S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weigth), Time 8-64")
We can atleast identify one outlier that is affecting the results. Let’s remove it and redraw the plot.
RATSL8S1 <- RATSL8S%>% filter(mean < 590 )
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=2, fill = "white") +
scale_y_continuous(name = "mean(Weigth), Time 8-64")
Now it is time to run the t test. However, the problem is that t test is meant for comparison between two group, but here we have three groups. We will instead use One-way ANOVA for within group comparison.
RATSL8S1owa <- aov(mean ~ Group, data = RATSL8S1)
summary(RATSL8S1owa)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 207659 103830 501.8 2.72e-12 ***
## Residuals 12 2483 207
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results from ANOVA tell us that the groups differ significantly between each other based on weight gain. However to find out which groups are different we need to run a post-hoc analysis.
TukeyHSD(RATSL8S1owa)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mean ~ Group, data = RATSL8S1)
##
## $Group
## diff lwr upr p adj
## 2-1 187.375 161.39457 213.3554 0.00e+00
## 3-1 262.475 238.97481 285.9752 0.00e+00
## 3-2 75.100 45.79012 104.4099 4.98e-05
Results from Tukey post-HOC test show that it is group 1 that differs from group 2 and 3 significantly.
But the problem is that we have not adjusted the model for baseline weight. That can be an important factor considering that the rats in group a had a significantly lower weight at the start of the study (as seen in the plots).
Let’s run a linear regression modal with baseline values as of the predictors.
RATS
## ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1 1 1 240 250 255 260 262 258 266 266 265 272 278
## 2 2 1 225 230 230 232 240 240 243 244 238 247 245
## 3 3 1 245 250 250 255 262 265 267 267 264 268 269
## 4 4 1 260 255 255 265 265 268 270 272 274 273 275
## 5 5 1 255 260 255 270 270 273 274 273 276 278 280
## 6 6 1 260 265 270 275 275 277 278 278 284 279 281
## 7 7 1 275 275 260 270 273 274 276 271 282 281 284
## 8 8 1 245 255 260 268 270 265 265 267 273 274 278
## 9 9 2 410 415 425 428 438 443 442 446 456 468 478
## 10 10 2 405 420 430 440 448 460 458 464 475 484 496
## 11 11 2 445 445 450 452 455 455 451 450 462 466 472
## 12 12 2 555 560 565 580 590 597 595 595 612 618 628
## 13 13 3 470 465 475 485 487 493 493 504 507 518 525
## 14 14 3 535 525 530 533 535 540 525 530 543 544 559
## 15 15 3 520 525 530 540 543 546 538 544 553 555 548
## 16 16 3 510 510 520 515 530 538 535 542 550 553 569
# Creating a new variable
RATSL8S2 <- RATSL8S %>%
mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After adjusting for baseline values, the group differences are not significant anymore! Thus these nutritional interventions did not cause a difference in weight gain in any of these groups.
In this section we will run linear mixed effects models for analysis. These models use random effects (within group and within individual variation in repeated measures) as predictor variables in the models. Repeated measures are correlated with each other, BUT When adjusted for these random effects, the repeated measured are considered independent.
First, let’s try to visualize the dataset in its long form. The graph shows BPRS scores for individuals in both treatment groups. I have drawn a loess curve in order to show the trend for the whole group. Also let’s make a matrix of BPRS scores.
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +geom_line() + stat_summary(aes(group = 1), geom = "point", fun.y = mean, shape = 17, size = 3) + facet_grid(. ~ treatment) + scale_x_continuous(name = "Time (weeks)") +scale_y_continuous(name = "BPRS") + theme(legend.position = "none") + stat_smooth(aes(group = 1))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
pairs(BPRS[3:11], col = BPRS$treatment)
Let’s run a simple linear regression that assumes that these repeated measures are independent
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
A simple regression model tells us that there are no significant differences between groups based on treatment.
Let’s now run a random intercept model which identifies the repeated nature of measures and also allows the linear regression fit for each individual rat separately.
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood . t-tests use
## Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 46.4539 1.9090 37.2392 24.334 <2e-16 ***
## week -2.2704 0.2084 340.0000 -10.896 <2e-16 ***
## treatment2 0.5722 1.0761 340.0000 0.532 0.595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Note that we have also included random effects that originate from within individual variation.
Now lets add time (week) into the random effects.
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood . t-tests use
## Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 46.4539 2.1052 22.6796 22.066 < 2e-16 ***
## week -2.2704 0.2977 19.9985 -7.626 2.42e-07 ***
## treatment2 0.5722 1.0405 320.0007 0.550 0.583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compared to the previous model, the new model that includes both week and subject random effects is significantly better.
Finally let’s also add the interaction between week and treatment group variables (i.e. fixed effect).
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week * treatment)+ (week|subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood . t-tests use
## Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week * treatment) + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 47.8856 2.2521 29.6312 21.262 < 2e-16 ***
## week -2.6283 0.3589 41.7201 -7.323 5.24e-09 ***
## treatment2 -2.2911 1.9090 319.9977 -1.200 0.2310
## week:treatment2 0.7158 0.4010 319.9977 1.785 0.0752 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week * treatment) + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This does not result in a significant improvement in our new model, which demonstrates that there is little or no interaction between treatment group and week variables.
In this model and all previous models, there was no significant difference in BPRS between treatment groups.
Let’s now draw a new plot based on our fitted model and compare it with the plot we drew at the beginning.
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +geom_line() + stat_summary(aes(group = 1), geom = "point", fun.y = mean, shape = 17, size = 3) + facet_grid(. ~ treatment) + scale_x_continuous(name = "Time (weeks)") +scale_y_continuous(name = "BPRS") + theme(legend.position = "none") + stat_smooth(aes(group = 1))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Fitted <- fitted(BPRS_ref2)
BPRSL <- mutate(BPRSL, Fitted = Fitted)
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +geom_line() + stat_summary(aes(group = 1), geom = "point", fun.y = mean, shape = 17, size = 3) + facet_grid(. ~ treatment) + scale_x_continuous(name = "Time (weeks)") +scale_y_continuous(name = "BPRS") + theme(legend.position = "none") + stat_smooth(aes(group = 1))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Notice how the model fits regression lines for all individuals and also notice the final regression line is fitted. One can visually appreciate as well that there is no difference in BPRS score trend between groups.